## Deep learning homework - Milestone 1
#####Topic: NLP - text summarization
#####Authors: Bence Halasz, Patrick Nanys, Mate Jakab (Goal Diggers)

###1. Data load from xlsx 

> Source: https://www.kaggle.com/shashichander009/inshorts-news-data



In [1]:
import pandas as pd

xls = pd.read_excel("Inshorts Cleaned Data.xlsx")
# Load articles, stories
input_raw = xls['Short']
# Load headlines for articles and stories
output_raw = xls['Headline']

# Show example
print(input_raw.head(), output_raw.head())

0    The CBI on Saturday booked four former officia...
1    Chief Justice JS Khehar has said the Supreme C...
2    At least three people were killed, including a...
3    Mukesh Ambani-led Reliance Industries (RIL) wa...
4    TV news anchor Arnab Goswami has said he was t...
Name: Short, dtype: object 0    4 ex-bank officials booked for cheating bank o...
1       Supreme Court to go paperless in 6 months: CJI
2    At least 3 killed, 30 injured in blast in Sylh...
3    Why has Reliance been barred from trading in f...
4    Was stopped from entering my own studio at Tim...
Name: Headline, dtype: object


###2. Taking out stopwords and shorten words (optional)
> Shortening options: lemmatizing/stemmimg



In [3]:
# nltk library and used modules
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.stem import WordNetLemmatizer 
from nltk.stem import PorterStemmer 
from nltk.corpus import wordnet

# Lemmatize with POS Tag (a given word gets the right POS tag)
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [4]:
print('Lemmatizing example')

words = ["dogs", "are", "better", "than", "cats"] 
for w in words:
    print(w, "\t->", WordNetLemmatizer().lemmatize(w, get_wordnet_pos(w)))

print('\nStemming example')

words = ["run", "ran", "runner", "running"] 
for w in words: 
    print(w, "->", PorterStemmer().stem(w)) 

Lemmatizing example
dogs 	-> dog
are 	-> be
better 	-> well
than 	-> than
cats 	-> cat

Stemming example
run -> run
ran -> ran
runner -> runner
running -> run


In [30]:
def takeoutstopwords_and_shorten(data, mode='none'):
    data_nltk = []

    # nltk's built in lemmatizer, stemmer and detokenizer
    lemmatizer = WordNetLemmatizer()  
    ps = PorterStemmer()
    detok = TreebankWordDetokenizer()

    # Cycle through data (range set to 10 instead of data.size until we use this)
    for i in range(data.size):  #range(data.size)
        # Tokenize
        tokenized = word_tokenize(data[i])
        # Take out stopwords
        without_stopword = [word for word in tokenized if not word in stopwords.words('english')]
        # Shorten if parameter 'mode' is 'lemmatize' or 'stem'
        shorter = []
        for word in without_stopword: 
            if mode == 'lemmatize':
                shorter.append(lemmatizer.lemmatize(word, get_wordnet_pos(word)))
            elif mode == 'stem':
                shorter.append(ps.stem(word))
            else:
                shorter.append(word)
        # Detokenize
        back_to_string = detok.detokenize(shorter)

        data_nltk.append(back_to_string)

    return data_nltk

In [None]:
# Take out stopwords and shorten input_raw data with lemmatize and stem
input_nltk_lemma = takeoutstopwords_and_shorten(input_raw, mode='lemmatize')
input_nltk_stem = takeoutstopwords_and_shorten(input_raw, mode='stem')

#Maybe it is useless to take out stopwords from output value (it is already short)
#output_nltk = takeoutstopwords_and_shorten(output_raw)

i = 2
print('Example\n', input_raw[i], '\n', input_nltk_lemma[i], '\n', input_nltk_stem[i])

#Not connected to tokenization yet

Example
 At least three people were killed, including a policeman, while 30 others were wounded on Saturday evening in two explosions in Sylhet, Bangladesh. The explosions were targetted at people and police officials who were witnessing an over 30-hour-long gunfight between extremists and commandos. Earlier on Friday, a man had blown himself up in front of a checkpoint near Dhaka Airport. 
 At least three people kill, include policeman , 30 others wound Saturday even two explosion Sylhet, Bangladesh . The explosion targetted people police official witness 30-hour-long gunfight extremist commando . Earlier Friday, man blown front checkpoint near Dhaka Airport. 
 At least three peopl kill, includ policeman , 30 other wound saturday even two explos sylhet, bangladesh . the explos target peopl polic offici wit 30-hour-long gunfight extremist commando . earlier friday, man blown front checkpoint near dhaka airport.


###3. Tokenization and padding

##### Obtaining length data

In [6]:
input_lengths = pd.Series([len(x) for x in input_raw])
output_lengths = pd.Series([len(x) for x in output_raw])
print('Inputs:\n', input_lengths.describe())
print('Outputs:\n', output_lengths.describe())

Inputs:
 count    55104.000000
mean       362.374129
std         24.985897
min        280.000000
25%        346.000000
50%        363.000000
75%        381.000000
max        448.000000
dtype: float64
Outputs:
 count    55104.000000
mean        50.016188
std          6.461184
min          8.000000
25%         46.000000
50%         48.000000
75%         56.000000
max         71.000000
dtype: float64


In [8]:
# maxlen
# taking values > and round figured to 75th percentile
# at the same time not leaving high variance
encoder_maxlen = 400
decoder_maxlen = 75

Adding start and stop signs

In [9]:
# should be applied to output
def apply_start_end(data):
  return data.apply(lambda x: '<START> ' + x + ' <END>')

#### Tokenization

In [10]:
from tensorflow.keras.preprocessing.text import Tokenizer

def tokenize(inputs, outputs):
  # since < and > from default tokens cannot be removed
  filters = '!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n'
  oov_token = '<UNK>'

  input_tokenizer = Tokenizer(oov_token=oov_token)
  output_tokenizer = Tokenizer(filters=filters, oov_token=oov_token)

  input_tokenizer.fit_on_texts(inputs)
  output_tokenizer.fit_on_texts(outputs)

  tokenized_inputs = input_tokenizer.texts_to_sequences(inputs)
  tokenized_outputs = output_tokenizer.texts_to_sequences(outputs)

  return tokenized_inputs, tokenized_outputs

#### Padding

In [12]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

def add_padding(inputs, outputs, encoder_maxlen, decoder_maxlen):
  padded_inputs = pad_sequences(inputs, maxlen=encoder_maxlen, padding='post', truncating='post')
  padded_outputs = pad_sequences(outputs, maxlen=decoder_maxlen, padding='post', truncating='post')
  return padded_inputs, padded_outputs

###4. Split dataset into train, validation and test datas

In [13]:
def dataset_split(X, Y, valid_split, test_split):
  v_start = int(len(X)*(1-valid_split-test_split))
  t_start = int(len(X)*(1-test_split))
  X_train, Y_train = X[:v_start], Y[:v_start]
  X_valid, Y_valid = X[v_start:t_start], Y[v_start:t_start]
  X_test , Y_test  = X[t_start:], Y[t_start:]
  return X_train, Y_train, X_valid, Y_valid, X_test, Y_test

###5. Testing

In [33]:
%%time

# Example data preprocessing

# replace unreadable part
input_raw = input_raw.replace(["&#39;"], ["'"], regex=True)
output_raw = output_raw.replace(["&#39;"], ["'"], regex=True)

start_end_output = apply_start_end(output_raw)
lemmatized_inputs = takeoutstopwords_and_shorten(input_raw, mode='lemmatize')
tokenized_inputs, tokenized_outputs = tokenize(lemmatized_inputs, start_end_output)
X, Y = add_padding(tokenized_inputs, tokenized_outputs, encoder_maxlen, decoder_maxlen)
X_train, Y_train, X_valid, Y_valid, X_test, Y_test = dataset_split(X, Y, valid_split=0.2, test_split=0.1)

CPU times: user 10min, sys: 47.5 s, total: 10min 47s
Wall time: 10min 48s


In [34]:
X_train.shape, Y_train.shape, X_valid.shape, Y_valid.shape, X_test.shape, Y_test.shape

((38572, 400), (38572, 75), (11021, 400), (11021, 75), (5511, 400), (5511, 75))

In [35]:
X_train.shape[0] + X_valid.shape[0] + X_test.shape[0]

55104