## NLP Tutorial

NLP - or *Natural Language Processing* - is shorthand for a wide array of techniques designed to help machines learn from text. Natural Language Processing powers everything from chatbots to search engines, and is used in diverse tasks like sentiment analysis and machine translation.

In this tutorial we'll look at this competition's dataset, use a simple technique to process it, build a machine learning model, and submit predictions for a score!

In [None]:
#! pip install tensorflow --upgrade
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
import re
import keras_tuner as kt
import tensorflow as tf

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

from tensorflow.keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from keras.backend import clear_session
from keras.datasets import mnist
from keras.layers import Conv2D
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Embedding
from keras.layers import Bidirectional
from keras.layers import Flatten
from keras.models import Sequential
from keras.callbacks import EarlyStopping
from keras.metrics import MeanAbsoluteError
from keras.losses import BinaryCrossentropy
from keras.losses import MeanAbsoluteError
from tensorflow.keras.optimizers import Adam
import matplotlib.pyplot as plt

nltk.download('omw-1.4')
print(tf.__version__)

In [None]:
train_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test_df = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

### A quick look at our data

Let's look at our data... first, an example of what is NOT a disaster tweet.

In [None]:
print(len(train_df))

In [None]:
train_df.target.value_counts()

In [None]:
train_df[train_df["target"] == 0]["text"].values[1]

And one that is:

In [None]:
train_df[train_df["target"] == 1]["text"].values[1]

In [None]:
stop_words=stopwords.words('english')
print(stop_words)

In [None]:
def clean_txt(text):
    text=text.lower()
    text=re.sub("[^A-Za-z0-9]"," ",text)
    return text.strip()

In [None]:
def make_txt(text):
    txt=clean_txt(text)
    tokens=word_tokenize(txt)
    # remove stopwords and lemma
    lemmatizer = WordNetLemmatizer()
    
    filters=[lemmatizer.lemmatize(w) for w in tokens if w not in stop_words]
    return " ".join(filters)

In [None]:
train_df["clean_text"]=train_df.text.apply(make_txt)
test_df["clean_text"]=test_df.text.apply(make_txt)

### Building vectors

The theory behind the model we'll build in this notebook is pretty simple: the words contained in each tweet are a good indicator of whether they're about a real disaster or not (this is not entirely correct, but it's a great place to start).

We'll use scikit-learn's `CountVectorizer` to count the words in each tweet and turn them into data our machine learning model can process.

Note: a `vector` is, in this context, a set of numbers that a machine learning model can work with. We'll look at one in just a second.

In [None]:
tokenizer=Tokenizer(num_words=500)
tokenizer.fit_on_texts(train_df.clean_text)

In [None]:
# Get our training data word index
vocab=len(tokenizer.word_index)
vocab

In [None]:
df_sequences=tokenizer.texts_to_sequences(train_df.clean_text)
df_test_seq=tokenizer.texts_to_sequences(test_df.clean_text)
df_sequences[:5]

In [None]:
max_len=max([len(i) for i in df_sequences])
max_len

In [None]:
df_padded=pad_sequences(df_sequences,maxlen=max_len)
df_test_padded=pad_sequences(df_test_seq,maxlen=max_len)
df_padded

In [None]:
X=np.array(df_padded)
y=train_df.target.values

In [None]:
count_vectorizer = feature_extraction.text.CountVectorizer()

## let's get counts for the first 5 tweets in the data
example_train_vectors_dirty = count_vectorizer.fit_transform(train_df["text"][0:5])
example_train_vectors_clean = count_vectorizer.fit_transform(train_df["clean_text"][0:5])

In [None]:
## we use .todense() here because these vectors are "sparse" (only non-zero elements are kept to save space)
print("DIRTY")
print(example_train_vectors_dirty[0].todense().shape)
print(example_train_vectors_dirty[0].todense())
print("CLEAN")
print(example_train_vectors_clean[0].todense().shape)
print(example_train_vectors_clean[0].todense())

The above tells us that:
1. There are 54 unique words (or "tokens") in the first five tweets.
2. The first tweet contains only some of those unique tokens - all of the non-zero counts above are the tokens that DO exist in the first tweet.

Now let's create vectors for all of our tweets.

In [None]:
train_vectors = count_vectorizer.fit_transform(train_df["clean_text"])

## note that we're NOT using .fit_transform() here. Using just .transform() makes sure
# that the tokens in the train vectors are the only ones mapped to the test vectors - 
# i.e. that the train and test vectors use the same set of tokens.
test_vectors = count_vectorizer.transform(test_df["clean_text"])

### Our model

As we mentioned above, we think the words contained in each tweet are a good indicator of whether they're about a real disaster or not. The presence of particular word (or set of words) in a tweet might link directly to whether or not that tweet is real.

What we're assuming here is a _linear_ connection. So let's build a linear model and see!

In [None]:
len(train_vectors.todense())

len(np.array(train_vectors.todense()[0])[0])

In [None]:
def build_model_1(hp):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(input_dim=len(np.array(train_vectors.todense()[0])[0]), output_dim=64, input_length=17),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64,return_sequences=True)),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(hp.Int('input_unit',min_value=32,max_value=256,step=32))),
        tf.keras.layers.Dropout(hp.Float('Dropout_rate',min_value=0,max_value=0.5,step=0.1)),
        tf.keras.layers.Dense(1,activation=hp.Choice('dense_activation',values=['relu', 'sigmoid'],default='relu'))
    ])


    model.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                  optimizer=tf.keras.optimizers.Adam(hp.Float('learning_rate',min_value=0,max_value=0.5,step=0.0001)),
                  metrics=['accuracy'])
    model.summary()
    return model

In [None]:
# def build_model(hp):
#     model = Sequential()
#     model.add(Embedding(input_dim=len(np.array(train_vectors.todense()[0])[0]), output_dim=32)),
#     model.add(LSTM(hp.Int('input_unit',min_value=32,max_value=256,step=32),return_sequences=False))
#   #  model.add(LSTM(hp.Int('layer_2_neurons',min_value=32,max_value=256,step=32)))
#     model.add(Dropout(hp.Float('Dropout_rate',min_value=0,max_value=0.5,step=0.1)))
#     model.add(Dense(1, activation=hp.Choice('dense_activation',values=['relu', 'sigmoid'],default='relu')))
#     model.compile(loss=MeanAbsoluteError(),
#                   optimizer=Adam(1e-4),
#                   metrics=['accuracy'])
#     model.summary()
#     return model

In [None]:
    epochs_standard = 100
    tuner = kt.Hyperband(
        hypermodel=build_model_1,
        objective=kt.Objective(name="val_accuracy",direction="max"),
        max_epochs=epochs_standard,
        factor=3,
        hyperband_iterations=1,
        overwrite=True
    )
    stop_early = EarlyStopping(monitor='val_accuracy', patience=5)

    tuner.search(X,y, epochs=epochs_standard, validation_split=0.2, callbacks=[stop_early])

In [None]:
# # Get the optimal hyperparameters
best_hps = tuner.get_best_hyperparameters(num_trials=3)[0]

In [None]:
print(f"""
The hyperparameter search is complete. The optimal drop out rate of layers for the optimizer
is {best_hps.get('Dropout_rate')}. The optimal dense_activation is {best_hps.get('dense_activation')}
""")

In [None]:
# Build the model with the optimal hyperparameters and train it on the data for x epochs
model = tuner.hypermodel.build(best_hps)
history = model.fit(X,y, epochs=epochs_standard, validation_split=0.2)

In [None]:
val_acc_per_epoch = history.history['val_accuracy']
best_epoch = val_acc_per_epoch.index(max(val_acc_per_epoch)) + 1
print('Best epoch: %d' % (best_epoch,))

In [None]:
hypermodel = tuner.hypermodel.build(best_hps)

In [None]:
# Retrain the model
hypermodel.fit(X,y,epochs=best_epoch, validation_split=0.4)

In [None]:
def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])

In [None]:
plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
plot_graphs(history, 'accuracy')
plt.subplot(1, 2, 2)
plot_graphs(history, 'loss')

Let's test our model and see how well it does on the training data. For this we'll use `cross-validation` - where we train on a portion of the known data, then validate it with the rest. If we do this several times (with different portions) we can get a good idea for how a particular model or method performs.

The metric for this competition is F1, so let's use that here.

The above scores aren't terrible! It looks like our assumption will score roughly 0.65 on the leaderboard. There are lots of ways to potentially improve on this (TFIDF, LSA, LSTM / RNNs, the list is long!) - give any of them a shot!

In the meantime, let's do predictions on our training set and build a submission for the competition.

In [None]:
predict = model.predict(df_test_padded)
predict

In [None]:
predict = np.around(predict)

In [None]:
sample_submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")

In [None]:
submit = pd.DataFrame({
        "id": sample_submission["id"],
        "target":predict.flatten()
    })

In [None]:
submit.target.value_counts()

In [None]:
submit.head()

In [None]:
submit.to_csv("submision.csv")

Now, in the viewer, you can submit the above file to the competition! Good luck!