# NLP Twitter Disaster

In this notebook, we will practice our NLP modeling skill to build a classification model to predict if a twitter concerns a disaster. The workflow is like follows:

1. Tokenize the tweets and convert into sequences.
2. Apply word embedding to convert the sequences into tensors.
3. Build a CNN model to classify the tweets.

## Tokenizing the text

In [None]:
import nltk
from nltk.tokenize import TweetTokenizer, NLTKWordTokenizer
from nltk.stem import WordNetLemmatizer
import numpy as np
import pandas as pd
import os
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
import gensim.downloader as api
import tensorflow_addons as tfa

In [None]:
nltk.download('omw-1.4')

In [None]:
DAT_DIR = "../input/nlp-getting-started"
train = pd.read_csv(os.path.join(DAT_DIR, "train.csv"))

In [None]:
train

Let quickly review sample of keywords and locations. It seems the keyword, if non-empty, could also repeat in the text. In addition, without context, the location may not be a good indicator of the disaster. Therefore, we will just use the text field in this study.

In [None]:
train[~train.keyword.isna()]

In [None]:
train.target.value_counts()

We will use the Twitter tokenizer first, then lemmerize the texts and convert to lower case, and finally save the results into texts. We will then use Keras's tokenizer to transform the text into sequences.

In [None]:
tk =TweetTokenizer()
#tk =NLTKWordTokenizer()
texts = train.text.apply(tk.tokenize)

In [None]:
lemmer = WordNetLemmatizer()
texts2 = [[lemmer.lemmatize(w).lower() for w in sentence] for sentence in texts]
text_jnt = [' '.join(txt) for txt in texts2]

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_jnt)
text_sequences = tokenizer.texts_to_sequences(text_jnt)

Let's pad the sequences.

In [None]:
text_sequences = tf.keras.preprocessing.sequence.pad_sequences(text_sequences)
print(f'text_sequences.shape = {text_sequences.shape}')

In [None]:
num_records = len(text_sequences)
max_seqlen = len(text_sequences[0])
print(f'num_records = {num_records}, max_seqlen= {max_seqlen}')

In [None]:
# Labels
NUM_CLASSES = 2
cat_labels = tf.keras.utils.to_categorical(train.target,
                                          num_classes=NUM_CLASSES)

In [None]:
word2idx = tokenizer.word_index
idx2word = {v:k for k, v in word2idx.items()}
word2idx['PAD'] = 0
idx2word[0] = 'PAD'
vocab_size = len(word2idx)
print(f'vocab_size = {vocab_size}')

For word embedding, let's review the available Twitter embeddings in the gensim module as are shown below.

In [None]:
[c for c in api.info()['models'].keys() if 'twitter' in c]

For simplicity, let's use the smallest embedding, i.e., 'glove-twitter-25'.

In [None]:
EMBEDDING_MODEL = 'glove-twitter-25'
EMBEDDING_DIM = 25
#EMBEDDING_MODEL = 'glove-twitter-50'
#EMBEDDING_DIM = 50
DATA_DIR = '.'
EMBEDDING_NUMPY_FILE = os.path.join(DATA_DIR, 'E.npy')

In [None]:
def build_embedding_matrix(sequences, word2idx, embedding_dim, embedding_file):
    if os.path.exists(embedding_file):
        E = np.load(embedding_file)
    else:
        vocab_size = len(word2idx)
        E = np.zeros((vocab_size, embedding_dim))
        word_vectors = api.load(EMBEDDING_MODEL)
        for word, idx in word2idx.items():
            try:
                E[idx] = word_vectors.word_vec(word)
            except KeyError:
                pass
            
        np.save(EMBEDDING_NUMPY_FILE, E)
        
    return E

In [None]:
E = build_embedding_matrix(text_sequences, word2idx, EMBEDDING_DIM, EMBEDDING_NUMPY_FILE)
print(f'Embedding matrix: {E.shape}')

# Build CNN

In this section, we will build a simple CNN to fit the data.

In [None]:
conv_num_filters = 514
conv_kernel_size = 3

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(vocab_size, EMBEDDING_DIM, input_length=max_seqlen,
                             weights = [E], trainable=False),
    
    # block 1
    tf.keras.layers.Conv1D(filters = conv_num_filters,
                          kernel_size=conv_kernel_size,
                          activation='relu'),    
    tf.keras.layers.SpatialDropout1D(0.2),
    tf.keras.layers.GlobalMaxPool1D(),
    
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(NUM_CLASSES, activation='softmax')
])

In [None]:
model.compile(optimizer='Adam',
             loss = 'categorical_crossentropy',
             metrics = [tfa.metrics.F1Score(num_classes=NUM_CLASSES)])
model.summary()

In [None]:
NUM_EPOCHS = 10
VERBOSE = 1
SEED = 1234
BATCH_SIZE = 128
VALIDATION_SPLIT = 0.2

In [None]:
tf.random.set_seed(SEED)
model.fit(text_sequences,
          cat_labels,
          epochs=NUM_EPOCHS,
          batch_size = BATCH_SIZE,
          verbose = VERBOSE,
          validation_split = VALIDATION_SPLIT)

The validation results seem good.  Therefore, we will proceed to prediction.

## Prediction on the testing data set

In [None]:
test = pd.read_csv(os.path.join(DAT_DIR, "test.csv"))
test_texts = test.text.apply(tk.tokenize)

In [None]:
test_texts2 = [[lemmer.lemmatize(w).lower() for w in sentence] for sentence in test_texts]
test_text_jnt = [' '.join(txt) for txt in test_texts2]
test_text_sequences = tokenizer.texts_to_sequences(test_text_jnt)

In [None]:
test_text_sequences = tf.keras.preprocessing.sequence.pad_sequences(test_text_sequences)
print(f'test_text_sequences.shape = {test_text_sequences.shape}')

In [None]:
test_cat_labels = model.predict(test_text_sequences)
print(test_cat_labels[:5])

In [None]:
test_labels = np.argmax(test_cat_labels, axis=1)
test['predict'] = test_labels

Let's review a few positive and negative predictions for sanity check. Overall, the prediction seems to make sense.

In [None]:
test[test.predict == 1]

In [None]:
test[test.predict == 0]

## Submit Predictions

In [None]:
sample_submission = pd.read_csv(os.path.join(DAT_DIR, 'sample_submission.csv'))

In [None]:
sample_submission['target'] = test['predict']
sample_submission

In [None]:
sample_submission.to_csv('submission.csv', index=None)