In [45]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

%matplotlib inline

import re
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize

import time
from numba import jit, njit

# Import dataset

In [100]:
df_origin = pd.read_csv('../data/wikihowAll.csv', delimiter=',', nrows=1000)

In [101]:
df_origin.head()

Unnamed: 0,headline,title,text
0,"\nKeep related supplies in the same area.,\nMa...",How to Be an Organized Artist1,"If you're a photographer, keep all the necess..."
1,\nCreate a sketch in the NeoPopRealist manner ...,How to Create a Neopoprealist Art Work,See the image for how this drawing develops s...
2,"\nGet a bachelor’s degree.,\nEnroll in a studi...",How to Be a Visual Effects Artist1,It is possible to become a VFX artist without...
3,\nStart with some experience or interest in ar...,How to Become an Art Investor,The best art investors do their research on t...
4,"\nKeep your reference materials, sketches, art...",How to Be an Organized Artist2,"As you start planning for a project or work, ..."


In [102]:
print('headline:\n' + df_origin['headline'][0])

headline:

Keep related supplies in the same area.,
Make an effort to clean a dedicated workspace after every session.,
Place loose supplies in large, clearly visible containers.,
Use clotheslines and clips to hang sketches, photos, and reference material.,
Use every inch of the room for storage, especially vertical space.,
Use chalkboard paint to make space for drafting ideas right on the walls.,
Purchase a label maker to make your organization strategy semi-permanent.,
Make a habit of throwing out old, excess, or useless stuff each month.


In [103]:
print('title:\n\n' + df_origin['title'][0])

title:

How to Be an Organized Artist1


In [104]:
print('text:\n\n' + df_origin['text'][0])

text:

 If you're a photographer, keep all the necessary lens, cords, and batteries in the same quadrant of your home or studio. Paints should be kept with brushes, cleaner, and canvas, print supplies should be by the ink, etc. Make broader groups and areas for your supplies to make finding them easier, limiting your search to a much smaller area. Some ideas include:


Essential supplies area -- the things you use every day.
Inspiration and reference area.
Dedicated work area .
Infrequent or secondary supplies area, tucked out of the way.;
, This doesn't mean cleaning the entire studio, it just means keeping the area immediately around the desk, easel, pottery wheel, etc. clean each night. Discard trash or unnecessary materials and wipe down dirty surfaces. Endeavor to leave the workspace in a way that you can sit down the next day and start working immediately, without having to do any work or tidying.


Even if the rest of your studio is a bit disorganized, an organized workspace wil

### As we can see, in this dataset, we can consider column 'headline' equivalent to summarized column 'text'. In this notebook I will try to build model using only column 'text' and then using different metrics compare results to column 'headline'

# Let's start with data preprocessing

In [105]:
df_origin.shape

(1000, 3)

### I won't use column 'title' so I will delete it

In [106]:
df_origin.drop('title', inplace=True, axis=1)

### Deleting rows with nan

In [107]:
df_origin.dropna(subset = ["text"], inplace=True)

### For now, I won't do any data processing with column 'headline', only with 'text'

### Let's create function for data preprocessing

In [108]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [109]:
def clean_text(texts):
    corpus = []
    
    for t in texts:
        text = re.sub(r"(@[A-Za-z]+)|([^A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", t)
        text = re.sub(r"\d+", "", text)
        text = text.lower()
        text = text.split()
        text = [w for w in text if not w in stop_words]
        text = ' '.join(text)
        corpus.append(text)
        
    return corpus

In [110]:
clean_text(df_origin['text'][0:1])

['youre photographer keep necessary lens cords batteries quadrant home studio paints kept brushes cleaner canvas print supplies ink etc make broader groups areas supplies make finding easier limiting search much smaller area ideas includeessential supplies area things use every dayinspiration reference areadedicated work area infrequent secondary supplies area tucked way doesnt mean cleaning entire studio means keeping area immediately around desk easel pottery wheel etc clean night discard trash unnecessary materials wipe dirty surfaces endeavor leave workspace way sit next day start working immediately without work tidyingeven rest studio bit disorganized organized workspace help get business every time want make art visual people lot artist clutter comes desire keep track supplies visually instead tucked sight using jars old glasses vases cheap clear plastic drawers keep things sight without leaving strewn haphazardly ideas beyond mentioned includecanvas shoe racks back doorwine rac

In [111]:
df = df_origin.copy()

In [112]:
s_t = time.time()
df['text'] = clean_text(df_origin['text'])
e_t = time.time()

In [113]:
print(str(e_t-s_t) + ' s')

1.6777215003967285 s


271.0607466697693 - 252000

132.57278633117676 - 100000

69.89255881309509 - 50000

In [114]:
df['text'].head()

0    youre photographer keep necessary lens cords b...
1    see image drawing develops stepbystep however ...
2    possible become vfx artist without college deg...
3    best art investors research pieces art buy som...
4    start planning project work youll likely gathe...
Name: text, dtype: object

In [115]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Lemmatization with POS tag

In [116]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [117]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [118]:
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [119]:
def word_lemmatizer(texts):
    corpus = []
    lemmatizer = WordNetLemmatizer()
    for text in texts:
        lem_text = [lemmatizer.lemmatize(i, get_wordnet_pos(i)) for i in word_tokenize(text)]
        corpus.append(lem_text)
    return corpus

In [121]:
s_t = time.time()
df['text'] = word_lemmatizer(df["text"])
e_t = time.time()

In [122]:
print(str(e_t-s_t) + ' s')

441.98477840423584 s


In [123]:
df['text'].head()

0    [youre, photographer, keep, necessary, lens, c...
1    [see, image, draw, develops, stepbystep, howev...
2    [possible, become, vfx, artist, without, colle...
3    [best, art, investor, research, piece, art, bu...
4    [start, planning, project, work, youll, likely...
Name: text, dtype: object

In [124]:
len(df['text'])

996

In [125]:
from sklearn.model_selection import train_test_split

In [129]:

x_train, x_val = train_test_split(df['text'], test_size = 0.2, random_state = 0, shuffle = True)

In [130]:
len(x_train)

796

In [132]:
len(x_val)

200

In [133]:
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.layers import LSTM, Input, TimeDistributed, Dense, Activation, RepeatVector, Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy

Using TensorFlow backend.


In [134]:
from keras.preprocessing.text import Tokenizer

In [135]:
def tokenize(sentences):
    # Create tokenizer
    text_tokenizer = Tokenizer()
    # Fit texts
    text_tokenizer.fit_on_texts(sentences)
    return text_tokenizer.texts_to_sequences(sentences), text_tokenizer

In [144]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(x_train)

In [146]:
thresh = 4

cnt = 0
tot_cnt = 0
freq = 0
tot_freq = 0

for key,value in tokenizer.word_counts.items():
    tot_cnt = tot_cnt + 1
    tot_freq = tot_freq + value
    if(value < thresh):
        cnt = cnt + 1
        freq = freq + value
    
print("% of rare words in vocabulary:", (cnt/tot_cnt) * 100)
print("Total Coverage of rare words:", (freq/tot_freq) * 100)

% of rare words in vocabulary: 73.98942198715527
Total Coverage of rare words: 6.183418058382013


In [148]:
x_tokenizer = Tokenizer(num_words = tot_cnt-cnt) 
x_tokenizer.fit_on_texts(list(x_train))

#convert text sequences into integer sequences (i.e one-hot encodeing all the words)
x_tr_seq    =   x_tokenizer.texts_to_sequences(x_train) 
x_val_seq   =   x_tokenizer.texts_to_sequences(x_val)

#padding zero upto maximum length
x_tr    =   pad_sequences(x_tr_seq,  maxlen = 100, padding = 'post')
x_val   =   pad_sequences(x_val_seq, maxlen = 100, padding = 'post')

#size of vocabulary ( +1 for padding token)
x_voc   =  x_tokenizer.num_words + 1

print("Size of vocabulary in X = {}".format(x_voc))

Size of vocabulary in X = 4132


In [169]:
latent_dim = 300
embedding_dim = 200

# Encoder
encoder_inputs = Input(shape=(100,))

#embedding layer
enc_emb =  Embedding(x_voc, embedding_dim, trainable = True)(encoder_inputs)

#encoder lstm 1
encoder_lstm1 = LSTM(latent_dim, return_sequences = True, return_state = True, dropout = 0.4, recurrent_dropout = 0.4)
encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)

#encoder lstm 2
encoder_lstm2 = LSTM(latent_dim,return_sequences = True,return_state = True, dropout = 0.4, recurrent_dropout = 0.4)
encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)

#encoder lstm 3
encoder_lstm3 = LSTM(latent_dim, return_state = True, return_sequences = True, dropout = 0.4, recurrent_dropout = 0.4)
encoder_output3, state_h3, state_c3 = encoder_lstm3(encoder_output2)

#encoder lstm 4
#encoder_lstm4 = LSTM(latent_dim, return_state = True, return_sequences = True, dropout = 0.4, recurrent_dropout = 0.4)
#encoder_output4, state_h4, state_c4 = encoder_lstm4(encoder_output3)

encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output3)
#encoder_outputs, state_h, state_c= encoder_lstm2(encoder_output1)

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape = (None,))

#embedding layer
dec_emb_layer = Embedding(x_voc, embedding_dim, trainable = True)
dec_emb = dec_emb_layer(decoder_inputs)

decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state = True, dropout = 0.4, recurrent_dropout = 0.2)
decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])

#dense layer
decoder_dense =  TimeDistributed(Dense(x_voc, activation='softmax'))
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model 
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.summary()

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_5 (InputLayer)            (None, 100)          0                                            
__________________________________________________________________________________________________
embedding_4 (Embedding)         (None, 100, 200)     826400      input_5[0][0]                    
__________________________________________________________________________________________________
lstm_8 (LSTM)                   [(None, 100, 300), ( 601200      embedding_4[0][0]                
__________________________________________________________________________________________________
lstm_9 (LSTM)                   [(None, 100, 300), ( 721200      lstm_8[0][0]                     
____________________________________________________________________________________________

In [170]:
from tensorflow.keras.callbacks import EarlyStopping

In [171]:
model.compile(optimizer = 'rmsprop',
              loss = 'sparse_categorical_crossentropy')

In [176]:
history = model.fit([x_tr, x_tr[:,:-1]], x_tr.reshape(x_tr.shape[0], x_tr.shape[1], 1)[:,1:],
                    epochs = 50,
                   batch_size=128)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50


KeyboardInterrupt: 

823    [one, best, thing, breakup, order, move, make,...
883    [crush, might, linger, daydream, could, hold, ...
113    [place, front, withers, slide, little, bit, ba...
613    [eye, communicate, lot, thing, without, say, w...
37     [pet, stay, home, keep, thermostat, comfortabl...
                             ...                        
839    [boyfriend, wont, stop, touch, youve, talk, se...
193    [really, like, boy, tell, figure, want, note, ...
633    [take, good, care, body, make, attractive, hel...
563    [dont, wait, around, hop, hear, life, without,...
688    [need, start, confirm, dont, like, friend, try...
Name: text, Length: 796, dtype: object