<a href="https://colab.research.google.com/github/kevinajordan/Deep-Learning-Projects/blob/master/generating-headlines/generating_headlines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LSTM with Attention - Generating News Headlines

This notebook attempts to implement the model described in this research paper:

[Generating News Headlines with Recurrent Neural
Networks](https://arxiv.org/pdf/1512.01712.pdf)

## Bahdanau Attention

For getting an intuitive understanding of attention and how I implemented it in this notebook, check this blog post out: [Attention in Deep Networks with Keras](https://towardsdatascience.com/light-on-math-ml-attention-with-keras-dc8dbc1fad39)

![Bahdanau Attention](https://miro.medium.com/max/1638/1*wcxAAgQ0n9gOXLRqhmaLGA.png)

### News Dataset
The English Gigaword dataset mentioned in the research paper costs $3,000 (https://catalog.ldc.upenn.edu/LDC2003T05). 

Since I don't want to spend that in order to do this project, I am substituting in this kaggle news dataset:

* https://www.kaggle.com/snapcrack/all-the-news/data


Due to the kaggle news dataset being much cleaner than the English Gigawords dataset, the data preprocessing steps will not be the same as mentioned in the paper.


In [0]:
!git clone https://github.com/kevinajordan/Deep-Learning-Projects.git

Cloning into 'Deep-Learning-Projects'...
remote: Enumerating objects: 54, done.[K
remote: Total 54 (delta 0), reused 0 (delta 0), pack-reused 54
Unpacking objects: 100% (54/54), done.
Checking out files: 100% (34/34), done.


In [0]:
!ls -la Deep-Learning-Projects/generating-headlines/data

total 1307920
drwxr-xr-x 3 root root      4096 Oct  7 21:09 .
drwxr-xr-x 3 root root      4096 Oct  7 21:03 ..
drwxrwxrwx 2 root root      4096 Oct  7 21:10 all-the-news
-rw-r--r-- 1 root root 669644288 Oct  7 21:07 all-the-news.tar
-rw-r--r-- 1 root root  26214400 Oct  7 21:03 all-the-news.tar.001
-rw-r--r-- 1 root root  26214400 Oct  7 21:03 all-the-news.tar.002
-rw-r--r-- 1 root root  26214400 Oct  7 21:03 all-the-news.tar.003
-rw-r--r-- 1 root root  26214400 Oct  7 21:03 all-the-news.tar.004
-rw-r--r-- 1 root root  26214400 Oct  7 21:03 all-the-news.tar.005
-rw-r--r-- 1 root root  26214400 Oct  7 21:03 all-the-news.tar.006
-rw-r--r-- 1 root root  26214400 Oct  7 21:03 all-the-news.tar.007
-rw-r--r-- 1 root root  26214400 Oct  7 21:03 all-the-news.tar.008
-rw-r--r-- 1 root root  26214400 Oct  7 21:03 all-the-news.tar.009
-rw-r--r-- 1 root root  26214400 Oct  7 21:03 all-the-news.tar.010
-rw-r--r-- 1 root root  26214400 Oct  7 21:03 all-the-news.tar.011
-rw-r--r-- 1 root root  262144

In [0]:
!cat Deep-Learning-Projects/generating-headlines/data/all-the-news.tar.0* > all-the-news.tar

In [0]:
!tar -xvf all-the-news.tar 

all-the-news/
all-the-news/articles1.csv
all-the-news/articles2.csv
all-the-news/articles3.csv


In [0]:
import os 
os.chdir('all-the-news')
print(os.getcwd())

/content/all-the-news


In [0]:
!ls -la /content/all-the-news/

total 653960
drwxrwxrwx 2 root root      4096 Oct  2 14:20 .
drwxr-xr-x 1 root root      4096 Oct  7 21:34 ..
-rwxrwxrwx 1 root root 203539364 Sep 21 03:26 articles1.csv
-rwxrwxrwx 1 root root 225757056 Sep 21 03:26 articles2.csv
-rwxrwxrwx 1 root root 240344348 Sep 21 03:27 articles3.csv


In [0]:
import numpy as np, pandas as pd, re, itertools, collections, nltk, string, random, unidecode

In [0]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [0]:
import glob
_files = glob.glob("*.csv")
print(_files)

['articles3.csv', 'articles2.csv', 'articles1.csv']


In [0]:
df_list = []
for filename in sorted(_files):
    df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)
full_df.to_csv('articles.csv')

In [0]:
full_df.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


In [0]:
full_df.isnull().sum()

Unnamed: 0         0
id                 0
title              2
publication        0
author         15876
date            2641
year            2641
month           2641
url            57011
content            0
dtype: int64

In [0]:
full_df.drop(columns=['author','url'], inplace=True)

In [0]:
print(len(full_df))
full_df.dropna(axis = 0, inplace=True)

In [0]:
len(full_df)

139927

## Data Preprocessing
* Headline and text are converted to lowercase
* Punctuation is separated from words.
* Headline and text are tokenized. 
* end-of-sequence token is added to both the headline and text.
* Articles with no headline or text are removed.
* All rare words are replaced with <unk> symbols, with only the 40k most frequent words kept.

### Dataset Forming
The data is split into a training and a holdout set. The holdout set consists of articles from the last month of data, with the second last month not included in either the training or holdout sets.

In [0]:
# converts all characters to lowercase
full_df.title = full_df.title.str.lower()
full_df.content = full_df.content.str.lower()

In [0]:
# getting all of the values for the headlines and content. data type stored are numpy arrays
headlines = full_df.title.values
content = full_df.content.values

In [0]:
content[0]

'WASHINGTON — Congressional Republicans have a new fear when it comes to their health care lawsuit against the Obama administration: They might win. The incoming Trump administration could choose to no longer defend the executive branch against the suit, which challenges the administration’s authority to spend billions of dollars on <eos>'

In [0]:
headlines[0]

'House Republicans Fret About Winning Their Health Care Suit - The New York Times'

In [0]:
for i in range(len(content)):
    #limit the content of each article to the first 50 words
    desc = content[i].split()
    desc = ' '.join(desc[:50])
    # appending <eos> tags to each article content
    content[i] = desc + ' <eos>'

In [0]:
# confirm it works
print(content[0])

WASHINGTON — Congressional Republicans have a new fear when it comes to their health care lawsuit against the Obama administration: They might win. The incoming Trump administration could choose to no longer defend the executive branch against the suit, which challenges the administration’s authority to spend billions of dollars on <eos>


### Tokenizing the words

In [0]:
# Choose the most frequent 5000 words from the vocabulary
import tensorflow as tf
words_limit = 40000
# filter acts the same as removing punctuation
# oov_token replaces every word past num_words with specified tag/token
# 'lower' parameter converts the text to lowercase
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=words_limit,
                                                  oov_token="<unk>",
                                                  filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ', lower=True)
# trains on the text
tokenizer.fit_on_texts(content)
# apply tokenizer to content. Converts all words to integers
content = tokenizer.texts_to_sequences(content)

In [0]:
content[0]

[95,
 18,
 708,
 245,
 29,
 3,
 32,
 1523,
 47,
 25,
 412,
 5,
 44,
 211,
 350,
 775,
 108,
 2,
 113,
 177,
 48,
 287,
 405,
 2,
 2238,
 22,
 177,
 112,
 2255,
 5,
 107,
 963,
 2440,
 2,
 241,
 4877,
 108,
 2,
 2155,
 64,
 2337,
 2,
 1359,
 2049,
 5,
 1891,
 3419,
 4,
 1389,
 9,
 7]

## WIP: Using Pre-Trained Word Embeddings - Word2Vec

In [0]:
!pip install unidecode

In [0]:
# Downloading word vectors
import gensim
from gensim import models
# Download Word Vectors
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

In [0]:
def read_word_embedding(file_name):
    """
    read word embedding file and assign indexes to word
    """
    idx = 3
    temp_word2vec_dict = {}
    # <empty>, <eos> tag replaced by word2vec learning
    # create random dimensional vector for empty, eos and unk tokens
    temp_word2vec_dict['<empty>'] = [float(i) for i in np.random.rand(embedding_dimension, 1)]
    temp_word2vec_dict['<eos>'] = [float(i) for i in np.random.rand(embedding_dimension, 1)]
    temp_word2vec_dict['<unk>'] = [float(i) for i in np.random.rand(embedding_dimension, 1)]
    model = gensim.models.KeyedVectors.load_word2vec_format(file_name, binary = True, limit = 40000)
    V = model.index2word
    X = np.zeros((top_freq_word_to_use, model.vector_size))
    for index, word in enumerate(V):
        vector = model[word]
        temp_word2vec_dict[idx] = vector
        word2idx[word] = idx
        idx2word[idx] = word
        idx = idx + 1
        if idx % 10000 == 0:
            print ("working on word2vec ... idx ", idx)
            
    return temp_word2vec_dict

In [0]:
temp_word2vec_dict = read_word_embedding('GoogleNews-vectors-negative300.bin.gz')
length_vocab = len(temp_word2vec_dict)
shape = (length_vocab, embedding_dimension)
# faster initlization and random for <empty> and <eos> tag
word2vec = np.random.uniform(low=-1, high=1, size=shape)
for i in range(length_vocab):
    if i in temp_word2vec_dict:
        word2vec[i, :] = temp_word2vec_dict[i]

In [0]:
all_characters = string.printable
n_characters = len(all_characters)

file = unidecode.unidecode(dictionary)
file_len = len(file)
print('file_len =', file_len)

file_len = 4778956


In [0]:
chunk_len = 10000

def random_chunk():
    start_index = random.randint(0, file_len - chunk_len)
    end_index = start_index + chunk_len + 1
    return file[start_index:end_index]

## Create LSTMs for Text Generation


In [0]:
EMBEDDING_DIM = 512

def lstm_model(seq_len=100, batch_size=384, stateful=True):
  """Language model: predict the next word given the current word."""
  source = tf.keras.Input(
      name='seed', shape=(seq_len,), batch_size=batch_size, dtype=tf.int32)

  embedding = tf.keras.layers.Embedding(input_dim=256, output_dim=EMBEDDING_DIM)(source)
  lstm_1 = tf.keras.layers.LSTM(EMBEDDING_DIM, stateful=stateful, return_sequences=True)(embedding)
  lstm_2 = tf.keras.layers.LSTM(EMBEDDING_DIM, stateful=stateful, return_sequences=True)(lstm_1)
  lstm_3 = tf.keras.layers.LSTM(EMBEDDING_DIM, stateful=stateful, return_sequences=True)(lstm_2)
  predicted_char = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(256, activation='softmax'))(lstm_3)
  return tf.keras.Model(inputs=[source], outputs=[predicted_char])

In [0]:
tf.keras.Input(shape)
# Encoding layers
embedding = tf.keras.layers.Embedding()
lstm1 = tf.keras.layers.LSTM(return_sequences=True)(embedding)
lstm2 = tf.keras.layers.LSTM()(lstm1)
attention1 = tf.keras.layers.Attention()(lstm2)

#Decoding layers
tf.keras.layers.Embedding()
tf.keras.layers.LSTM()
tf.keras.layers.LSTM()
tf.keras.layers.Attention()

In [0]:
import random, codecs, math, time, sys, subprocess, os.path, pickle
import numpy as np, pandas as pd 
import gensim
from numpy import inf
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from nltk.translate.bleu_score import sentence_bleu

In [0]:
all_characters = string.printable
n_characters = len(all_characters)

file = unidecode.unidecode(dictionary)
file_len = len(file)
print('file_len =', file_len)

In [0]:
chunk_len = 10000

def random_chunk():
    start_index = random.randint(0, file_len - chunk_len)
    end_index = start_index + chunk_len + 1
    return file[start_index:end_index]