# Hey there, welcome to your final challenge

You're finally going to have a chance to apply the skills you've been learning on a real-life dataset.

Your task is to predict if a given news headline is sarcastic or not.

<img src="images/you_dont_say.jpg" width="400">

--- 

A good way to think about how a machine might detect sarcasm is to try it yourself.
Which of these headlines is sarcastic?

* "Thinking about the way you look all the time burns 5,000 calories an hour"
* "Safeguarding the well-being of children"

How did you make your decision? Which features of the text gave it away?
This is the thought process you will need to make a good model.

--- 

Feel free to be creative and write your own code wherever you want!

The provided functions are only there to help you if you get stuck :)

### Imports

In [7]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from gensim.models.keyedvectors import Word2VecKeyedVectors
from nltk.corpus import stopwords
import string
import nltk
nltk.download("stopwords")



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ollie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [36]:
import statsmodels.api as sm


### Loading the data

In [2]:
# Load the in the data from CSV
fp = '../data/train_df.csv'
df = pd.read_csv(fp)

# Print the number of headlines that we're dealing with
print(f'Size of data: {len(df)}')

# Print the percentage of headlines that are sarcastic
sarcasm_percentage = int(100 * sum(df['is_sarcastic']) / len(df))
print(f'Sarcasm percentage: {sarcasm_percentage}%')

# Show a sample of the data
df.head()

Size of data: 15862
Sarcasm percentage: 43%


Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


### Cleaning the data

In [3]:
stop_words = stopwords.words('english')

def basic_tokenizer(sentence):
    out = sentence
    for x in string.punctuation:
        out = out.replace(x, '')
    out = out.lower().split()
    return out

def remove_stopwords(sentence, stop_words):
    return [word for word in sentence if word not in stop_words]

def preprocess(sentence, stop_words):
    clean_sentence = remove_stopwords(basic_tokenizer(sentence), stop_words)
    return clean_sentence

df['clean_head'] = df['headline'].apply(lambda x: preprocess(x, stop_words))

df.head()

Unnamed: 0,article_link,headline,is_sarcastic,clean_head
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0,"[former, versace, store, clerk, sues, secret, ..."
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0,"[roseanne, revival, catches, thorny, political..."
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1,"[mom, starting, fear, sons, web, series, close..."
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1,"[boehner, wants, wife, listen, come, alternati..."
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0,"[jk, rowling, wishes, snape, happy, birthday, ..."


### Features

The first step we have to take in a machine learning task is to create features.

Below our features will be the counts of each word in the vocab in each sentence.

In [4]:
def sentence_to_bow(sentence, vocab):
    bow = [sentence.count(w) for w in vocab]
    return bow

#vocab = ['trump', 'nation', 'area', 'onion']
vocab = []
for item in df['clean_head']:
    for word in item:
        if word not in vocab:
            vocab.append(word)
df['bow'] = (df['clean_head'].apply(lambda x: sentence_to_bow(x, vocab)))

df.head()

Unnamed: 0,article_link,headline,is_sarcastic,clean_head,bow
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0,"[former, versace, store, clerk, sues, secret, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, ..."
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0,"[roseanne, revival, catches, thorny, political...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, ..."
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1,"[mom, starting, fear, sons, web, series, close...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1,"[boehner, wants, wife, listen, come, alternati...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0,"[jk, rowling, wishes, snape, happy, birthday, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [5]:
w2vmodel = Word2VecKeyedVectors.load("./models/word2vec.model")
w2vmodel.init_sims(replace=True)

  w2vmodel.init_sims(replace=True)


### Train and test

Once we've created our features, we need to split out data into train and test datasets.



In [6]:
vecs = []
for item in df['clean_head']:
    total = [0]*300
    for word in item:
        try:
            total += w2vmodel[word]
        except:
            print("word {} not in vocab".format(word))
    vecs.append((total))

df['vecs'] = vecs
df.head()
    


 vocab
word clif not in vocab
word clif not in vocab
word hasnt not in vocab
word aoki not in vocab
word 12 not in vocab
word 12 not in vocab
word chaffetz not in vocab
word antitrump not in vocab
word blackrubberclad not in vocab
word ryans not in vocab
word worldweary not in vocab
word 3ring not in vocab
word romneys not in vocab
word 60 not in vocab
word heimlich not in vocab
word ruffalo not in vocab
word retrocrazed not in vocab
word sciencefiction not in vocab
word 20s not in vocab
word it not in vocab
word wasnt not in vocab
word corbyn not in vocab
word 10 not in vocab
word 2000 not in vocab
word freerange not in vocab
word hmo not in vocab
word stouffers not in vocab
word eichner not in vocab
word moines not in vocab
word perate not in vocab
word 60 not in vocab
word 13 not in vocab
word hemsworth not in vocab
word instagram not in vocab
word dobrev not in vocab
word gingrich not in vocab
word shouldnt not in vocab
word kaepernick not in vocab
word tripledecker not in vocab
w

Unnamed: 0,article_link,headline,is_sarcastic,clean_head,bow,vecs
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0,"[former, versace, store, clerk, sues, secret, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, ...","[0.32401923555880785, 0.28006773302331567, -0...."
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0,"[roseanne, revival, catches, thorny, political...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, ...","[0.1960348472930491, 0.1813446283340454, -0.06..."
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1,"[mom, starting, fear, sons, web, series, close...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.14253034256398678, -0.057082670740783215, 0..."
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1,"[boehner, wants, wife, listen, come, alternati...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.09356081765145063, -0.011456608772277832, 0..."
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0,"[jk, rowling, wishes, snape, happy, birthday, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.07492709212237969, 0.2086553107947111, 0.13..."


In [24]:
linear_regression = LinearRegression()
y = df['is_sarcastic']
x = list(np.array(df[['vecs','bow']]))
print(x)
linear_regression.fit(x,y)

, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

ValueError: setting an array element with a sequence.

In [None]:
y_pred = linear_regression.predict(x)
print(y_pred)

In [18]:
#X = np.array(list(df['vecs']), dtype=np.float32)
#temp = [[list(df['vecs'])],[[list(df['bow'])]]]

X = list(df[["bow","vecs"]])
print(X)
y = df['is_sarcastic'].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
#X_train.shape, X_test.shape, y_train.shape, y_test.shape

#print(X_train)

['bow', 'vecs']


ValueError: Found input variables with inconsistent numbers of samples: [2, 15862]

In [None]:
lr_1 = sm.OLS(y_train, X_train).fit()
lr_1.summary()

### Logistic Regression

Suppose we have a features $x$ with a target output $y$.

For our use case $x$ could be a particular word, and $y$ is whether the headline is sarcastic or not.

---

In school you will (hopefully) have learned about *linear regression*,
where the goal is to learn coefficients $a$ and $b$
that minimize the error

$err = \sqrt {(y_* - y)^2}$

where $y_* = ax + b$

This is a good method when the $y_*$ are unbounded, 
but we need our outputs to be probabilities,
so they have to lie in the closed interval $[0, 1]$.

Exercise: Check that for any real-valued input $y_*, \sigma(y_*)$ lies in $[0, 1]$.

---

We can achieve this with *logistic regression*.

This is an adapation of linear regression where we push $y_*$ through a *logistic function*

$\sigma(y_*) = \frac{1}{1 + e^{-y_*}}$

---

Above we only have one feature, but regression can be generalized to work for multiple features. 

We will be borrowing an implementation of logistic regression from Sklearn.

It also has other classification models which you can explore [here](https://scikit-learn.org/stable/supervised_learning.html).

This model gives us a baseline of about 60% accuracy on the test set.

By changing the features you should be able to achieve over 90% accuracy.

We believe in you!

In [11]:
def fit_lr(X_train, y_train):
    # Train a logistic regression model
    model = LogisticRegression()
    model.fit(X_train, y_train)
    # Get the accuracy of the model on the train and the test data
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)

    print(f'Train accuracy: {int(100*train_score)}%')
    print(f'Test accuracy: {int(100*test_score)}%') 
    
    return model

model = fit_lr(X_train, y_train)

ValueError: setting an array element with a sequence.

### Alternative features

Inside of using word counts, we can use the mean word embeddings of headlines as features.

This gives us about 70% accuracy on the test set.

In [19]:
w2vmodel = Word2VecKeyedVectors.load("./models/word2vec.model")
w2vmodel.init_sims(replace=True)

In [20]:
def get_mean_vec(head):
    vecs = [w2vmodel[w] for w in head if w in w2vmodel]
    
    if vecs:
        mean_vec = np.array(np.mean(vecs, axis=0), dtype=np.float32)
    else:
        mean_vec = np.zeros(300, dtype=np.float32)
    
    return mean_vec

In [21]:
df['mean_vec'] = df['clean_head'].apply(lambda x: get_mean_vec(x))

In [23]:
X = np.array(list(df['mean_vec']), dtype=np.float32)
y = df['is_sarcastic'].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

[[ 0.03240193  0.02800677 -0.01525985 ... -0.02224203  0.02556937
   0.01908811]
 [ 0.02800498  0.02590637 -0.00986978 ... -0.00657245  0.01931108
   0.04447164]
 [ 0.0158367  -0.00634252  0.01694649 ... -0.04116534  0.00284461
   0.0050803 ]
 ...
 [ 0.04946033  0.04542911 -0.01374706 ... -0.05818217  0.01068554
  -0.02426743]
 [ 0.02833284 -0.01973758 -0.02485151 ...  0.00610715  0.01327639
  -0.01566681]
 [ 0.0231804   0.02762462  0.01079484 ... -0.03451134 -0.00932075
  -0.00485594]]


In [75]:
model = fit_lr(X_train, y_train)

Train accuracy: 74%
Test accuracy: 74%


### Debugging

To improve our model we'll need to examine where it's going wrong.

Let's have a look at some of the headlines where it's failing.

In [41]:
def debug_output(y_pred, y_test, df, n_samples=5):
    """
    Debug the output of your model
    
    :param y_pred: np.ndarray[bool]
                 : Predictions of the sarcasm of X_test
    :param y_test: np.ndarray[bool]
                 : The true sarcasm of X_test
    :param df: pd.core.frame.DataFrame
             : The sarcasm data
    :param n_samples: int (default = 5)
                    : The number of samples to debug
    """
    n_train = len(df) - len(y_pred)
    test_headlines = df['headline'].to_numpy()[n_train:]
    
    # Find the predictions which were wrong
    failed_idx = np.where(y_pred != y_test)[0]
    
    # Display a sample of the headlines where our model failed
    for idx in failed_idx[:n_samples]:
        print(f'Idx: {idx}')
        print(f'Prediction: {y_pred[idx]}')
        print(f'Actual: {y_test[idx]}')
        print(f'Headline: {test_headlines[idx]}')
        print('')
    
y_pred = model.predict(X_test)
debug_output(y_pred, y_test, df, n_samples=10)

Idx: 1
Prediction: 0
Actual: 1
Headline: man claims ex cares more about 'nonexistent' singing career than their daughter

Idx: 2
Prediction: 0
Actual: 1
Headline: feds give $43 million to fast track development of ebola vaccines

Idx: 3
Prediction: 0
Actual: 1
Headline: donald trump helped spread birtherism. now he can't stop it.

Idx: 11
Prediction: 0
Actual: 1
Headline: maine leading the way on government of, for and by the people

Idx: 17
Prediction: 0
Actual: 1
Headline: rare pokemon sparks massive stampede in taiwan

Idx: 18
Prediction: 0
Actual: 1
Headline: what's damaging about fat-shaming: the last acceptable form of bias

Idx: 29
Prediction: 1
Actual: 0
Headline: crisis and context for virgin galactic

Idx: 39
Prediction: 1
Actual: 0
Headline: want to sleep in 'the world's largest grave'? airbnb to the rescue

Idx: 42
Prediction: 0
Actual: 1
Headline: police say conditions too nippy to rescue missing hiker

Idx: 50
Prediction: 0
Actual: 1
Headline: donald trump's supreme court

### Further debugging

To improve our model it is important to examine how it's making predictions.
We can do this by looking at the importance that it places on each word.

If the coefficient is positive, the word is used mostly in sarcastic headlines.

If the coefficient is negative, the word is used mostly in non-sarcastic headlines.

---

Could this be useful for detecting which features are important?

Could this be useful for detecting overfitting?

--- 

Note - This only works when using the bag-of-words features

In [42]:
def print_important_words(coef, vocab, n_words=10):
    """
    Print the most important words according to the model coefficients
    
    :param coef: list[float]
               : The importance of each coefficient
    :param vocab: list[str]
                : The vocabulary corresponding to the coefficients
    :param n_words: int (default = 10)
                  : The number of top words to show
    """
    # Put the coefficients in order from biggest magnitude to smallest magnitude
    top_idx = np.argsort(-abs(coef))
    
    # Take the top `n_words` coefficients
    top_idx = top_idx[:n_words]
    
    for i in top_idx:
        print(f'{vocab[i]}: {coef[i]:.2f}')

coef = model.coef_[0]
print_important_words(coef, vocab)

IndexError: list index out of range

### Headline examination

If we're interested in a particular word, we can print the headlines that contain it.

This will help us to select new features and gain a better understanding of the current features.

In [45]:
def print_headlines_with_word(df, word):
    """
    Find the headlines which contain the given word
    
    :param df: pd.core.frame.DataFrame
             : The sarcasm data
    :param word: str
               : The word of interest
    """
    
    sarcasm = 0
    not_sarcasm = 0
    
    for idx, row in df.iterrows():
        
        # Check if the word is in the headline
        if word in row['headline'].split(' '):
            
            print(f'Label: {row["is_sarcastic"]}')
            print(f'Headline: {row["headline"]}\n')
            
            if row['is_sarcastic']:
                sarcasm += 1
            else:
                not_sarcasm += 1
                
    print(f'Sarcasm: {sarcasm}')
    print(f'Not sarcasm: {not_sarcasm}')
            
print_headlines_with_word(df, 'trump')

 trump voters say they oppose key elements of gop obamacare replacement

Label: 0
Headline: sunday show hosts hit back on trump administration's lies

Label: 0
Headline: new ad hammers trump as too impulsive to allow near the nuclear button

Label: 0
Headline: donald trump vows to take travel ban to the supreme court

Label: 1
Headline: trump struck by beautiful vision of what america could be while looking out over seething, screaming arizona crowd

Label: 1
Headline: trump surrogate enjoying thrill of not knowing what she going to be defending minute to minute

Label: 1
Headline: ​report: all standing between trump and presidency is nation that made him billionaire celebrity

Label: 0
Headline: seth meyers has a not-so-subtle message for donald trump

Label: 0
Headline: trump replaces ice chief daniel ragsdale, appoints thomas homan

Label: 0
Headline: carly fiorina scores well on social media in face-off with trump

Label: 0
Headline: will the trump administration ever acknowledge c