## Disaster Tweets - Embedding Methods

In this Notebook, I will use Word2Vec and GloVe Vector Embedding method on Tweets\
I will also use LSTM on the embedded texts\
Here is an overview:

- Load, clean and preprocess data
- Embedding with Word2Vec and GloVe
- Develop LSTM Models, train and predict

In [8]:
import pandas as pd
import numpy as np
import tensorflow as tf

In [9]:
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')

In [41]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import re
import contractions
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
from spellchecker import SpellChecker


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lxie1\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lxie1\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\lxie1\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Text Cleaning
Here, text are being cleaned!

- Remove URL, signs and URLs
- Lemmatize the texts
- Perform Contraction on the texts

In [44]:
stop_words=nltk.corpus.stopwords.words('english')
i=0

wnl=WordNetLemmatizer()
stemmer=PorterStemmer()
for doc in train.text:
    doc=re.sub(r'https?://\S+|www\.\S+','',doc)
    doc=re.sub(r'<.*?>','',doc)
    doc=re.sub(r'[^a-zA-Z\s]','',doc,re.I|re.A)
    
    doc=' '.join([wnl.lemmatize(i) for i in doc.lower().split()])
    doc=contractions.fix(doc)
    tokens=nltk.word_tokenize(doc)
    filtered=[token for token in tokens if token not in stop_words]
    doc=' '.join(filtered)
    train.text[i]=doc.lower()
    i+=1
i=0
for doc in test.text:
    doc=re.sub(r'https?://\S+|www\.\S+','',doc)
    doc=re.sub(r'<.*?>','',doc)
    doc=re.sub(r'[^a-zA-Z\s]','',doc,re.I|re.A)
    
    doc=' '.join([wnl.lemmatize(i) for i in doc.lower().split()])
    doc=contractions.fix(doc)
    tokens=nltk.word_tokenize(doc)
    filtered=[token for token in tokens if token not in stop_words]
    doc=' '.join(filtered)
    test.text[i]=doc.lower()
    i+=1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [47]:
train

Unnamed: 0,id,keyword,location,text,target
0,1,,,deed reason earthquake may allah forgive,1
1,4,,,forest fire near la ronge sask canada,1
2,5,,,resident asked shelter place notified officer ...,1
3,6,,,people receive wildfire evacuation order calif...,1
4,7,,,got sent photo ruby alaska smoke wildfire pour...,1
...,...,...,...,...,...
7608,10869,,,two giant crane holding bridge collapse nearby...,1
7609,10870,,,ariaahrary thetawniest control wild fire calif...,1
7610,10871,,,utckm volcano hawaii,1
7611,10872,,,police investigating ebike collided car little...,1


- Let's look at the processed tweets!

## Word Embeddings
### Word2Vec

In [24]:
from gensim.models import word2vec



In [25]:
word2vec_model = word2vec.Word2Vec([nltk.word_tokenize(doc) for doc in train.text], #tokenized_corpus
                                 vector_size = 15, # feature size
                                 window = 20, # context window
                                 min_count = 1, # word count
                                 sg = 1, # 1 for skipgram, cbow otherwise
                                 sample = 1e-3, # downsample settling for frequent words
                                 
                                )

### GloVe
- I decide to use 200 dimensional

In [48]:
dict1={}
file = open('C:\\Users\\lxie1\\Disaster Tweets\\glove.6B.200d.txt',encoding='utf-8')


for f in file:
    values=f.split()
    word=values[0]
    vectors=np.asarray(values[1:],'float32')
    dict1[word]=vectors
file.close()

Tokenizer to convert texts to sequences\
I decided to use a max length of 120\
Now I have a 7613 x 120 matrix representing all the tweets

In [49]:
tok=tf.keras.preprocessing.text.Tokenizer()


tok.fit_on_texts([nltk.word_tokenize(doc) for doc in train.text])
seq_train=tok.texts_to_sequences([nltk.word_tokenize(doc) for doc in train.text])
seq_test=tok.texts_to_sequences([nltk.word_tokenize(doc) for doc in test.text])
pad_train=tf.keras.preprocessing.sequence.pad_sequences(seq_train,maxlen= 120,padding='post',truncating='post')
pad_test=tf.keras.preprocessing.sequence.pad_sequences(seq_test,maxlen= 120,padding='post',truncating='post')

Create an embedding matrix, with the 200d Glove\
Now, I have a matrix reprensetation of all the words in my corpus\
Each row is a text in the corpus, followed by the 200d Glove that represents it

In [50]:
emb_matrix=np.zeros((len(tok.word_index)+1,200))
for word,i in tok.word_index.items():
    if dict1.get(word) is not None:
        emb_matrix[i]=dict1.get(word)

In [51]:
emb_matrix

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.12039   ,  0.15834001,  0.30823001, ..., -0.13108   ,
         0.36555001,  0.55979002],
       [ 0.41655001,  0.40977001, -0.099598  , ...,  0.15505999,
        -0.98438001,  0.23274   ],
       ...,
       [-0.54595   ,  0.08216   , -0.12738   , ...,  0.11586   ,
        -0.30774   , -0.82823998],
       [-0.65634   ,  0.56822002, -0.17518   , ..., -0.14872999,
        -0.22652   , -0.080195  ],
       [-0.30206001, -0.07879   , -0.059084  , ..., -0.096268  ,
        -0.33689001, -0.24123   ]])

## Model Building

In [55]:
model=tf.keras.Sequential([
    tf.keras.layers.Embedding(len(tok.word_index)+1,200,weights=[emb_matrix],input_length=120,mask_zero=True,trainable=False),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(100, dropout=0.2,recurrent_dropout=0.2,return_sequences=True)),
    tf.keras.layers.GlobalMaxPooling1D(), 
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(64,activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(32,activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(1,activation='sigmoid')
])

In [100]:
def model_builder(hp):
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Embedding(len(tok.word_index)+1,200,weights=[emb_matrix],input_length=120,mask_zero=True,trainable=False))
    model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(100, dropout=0.2,recurrent_dropout=0.2,return_sequences=True)))
    model.add(tf.keras.layers.GlobalMaxPooling1D())
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Dropout(0.2))
    
    hp_units = hp.Int('units', min_value=32, max_value=512, step=32)
    model.add(tf.keras.layers.Dense(units = hp_units,activation='relu'))
    
    model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
    
    hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=hp_learning_rate),
                loss=tf.keras.losses.BinaryCrossentropy(),
                metrics=['accuracy'])
    return model

In [101]:
tuner = kt.Hyperband(model_builder,
                     objective='val_accuracy',
                     max_epochs=10,
                     factor=3,
                     directory='C:\\Users\\lxie1\\Disaster Tweets',
                     project_name = 'kt_disaster')

INFO:tensorflow:Reloading Oracle from existing project C:\Users\lxie1\Disaster Tweets\kt_disaster\oracle.json
INFO:tensorflow:Reloading Tuner from C:\Users\lxie1\Disaster Tweets\kt_disaster\tuner0.json


In [102]:
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)

In [103]:
tuner.search(pad_train, train.target, epochs=20, validation_split=0.2, callbacks=[stop_early])


Trial 21 Complete [00h 16m 39s]
val_accuracy: 0.8214051127433777

Best val_accuracy So Far: 0.8214051127433777
Total elapsed time: 03h 24m 48s

Search: Running Trial #22

Hyperparameter    |Value             |Best Value So Far 
units             |128               |224               
learning_rate     |0.0001            |0.01              
tuner/epochs      |4                 |4                 
tuner/initial_e...|0                 |0                 
tuner/bracket     |1                 |1                 
tuner/round       |0                 |0                 

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4

KeyboardInterrupt: 

In [90]:
pad_train

array([[3911,  446,  152, ...,    0,    0,    0],
       [ 105,    2,  144, ...,    0,    0,    0],
       [1506, 1378, 1844, ...,    0,    0,    0],
       ...,
       [3652,  442, 1366, ...,    0,    0,    0],
       [  20,  978, 2772, ...,    0,    0,    0],
       [ 128,   22,  426, ...,    0,    0,    0]])

In [77]:
print(model.summary())
model.compile(loss=tf.keras.losses.BinaryCrossentropy(), # as this is a binary classification problem
              optimizer=tf.keras.optimizers.Adam(learning_rate=[1e-2, 1e-3, 1e-4]),
              metrics=['accuracy'])

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 120, 200)          3077400   
_________________________________________________________________
bidirectional_2 (Bidirection (None, 120, 200)          240800    
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 200)               0         
_________________________________________________________________
batch_normalization_2 (Batch (None, 200)               800       
_________________________________________________________________
dropout_6 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 64)                12864     
_________________________________________________________________
dropout_7 (Dropout)          (None, 64)               

In [84]:
from kerastuner.tuners import Hyperband

In [83]:
import kerastuner as kt

In [60]:
checkpoint=tf.keras.callbacks.ModelCheckpoint('model.h5',monitor='val_loss',save_best_only=True)
reduce_lr=tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss',factor=0.2,patience=2,min_lr=1e-5)
es=tf.keras.callbacks.EarlyStopping(monitor='val_loss',patience=3,restore_best_weights=True)

history=model.fit(pad_train,train.target,batch_size=100,epochs=15,validation_split=0.2,callbacks=[checkpoint,es,reduce_lr])

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15


Looks like the model early stopped at Epoch 12\
- 0.8374 looks like a good accuracy

## Conclusion and submission CSV

- Predict the values and make it to CSV

In [61]:
pred=model.predict(pad_test) 

In [62]:
pd.DataFrame({
    'id':test.id,
    'target':np.where(pred>0.50,1,0)[:,0] 
}).to_csv('submission_GloVe.csv',index=False)