# Bag-of-words example

This notebook is a brief introduction to using bag of words. It is not meant as a guide to building a good Natural Language Processing network (it doesn't).

The data used in this example is from Kaggle's Disaster Tweets: https://www.kaggle.com/competitions/nlp-getting-started/overview

In [None]:
import pandas as pd
import numpy as np

tweets = pd.read_csv('train.csv')

tweets[tweets.target == 0].text.to_numpy()

The target is 1, if the tweet is about a disaster, and 0 otherwise. We'll try to 

Since this is about NLP, I'll just use the text even though the keyword looks useful.

The CountVectorizer finds all distinct words in the body of text (that is, all the rows). It returns a vector for each input text. The vector has a word count for how many times the word occured in the input text.

Note the conversions to numpy arrays. Keras is none to happy with Pandas Dataframes.

The shape of _X_ reveals that we have 7613 vectors (texts) and 21637 distinct words.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer


y = tweets.target.to_numpy()
X = tweets.text.to_numpy()

X.shape, y.shape

In [None]:
from sklearn.model_selection import train_test_split

X_, X_test, y_, y_test = train_test_split(X, y, train_size=.8, random_state=504)
X_train, X_validate, y_train, y_validate = train_test_split(X_, y_, train_size=.75, random_state=504)

In [None]:
import tensorflow as tf
from tensorflow.keras import layers

vocab_size = 5000
embed_dim = 128

vectorizationLayer = layers.TextVectorization(max_tokens=vocab_size)
vectorizationLayer.adapt(X)
vectorizationLayer("It really looks like it will rain")

In [None]:
ann = tf.keras.Sequential([
    vectorizationLayer,
    layers.Embedding(vocab_size, embed_dim, mask_zero=True, embeddings_regularizer=tf.keras.regularizers.L1(.005)),
    layers.GRU(8),
    layers.Dense(1, activation='sigmoid')
])

In [None]:
ann.compile(loss='binary_crossentropy', optimizer=tf.keras.optimizers.Adam(learning_rate=0.002), metrics=['accuracy'])

In [None]:
es = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True, start_from_epoch=25)
lr = tf.keras.callbacks.ReduceLROnPlateau(patience=10, factor=.5, min_lr=1e-5)

history = ann.fit(X_train, y_train, epochs = 100, validation_data=(X_validate, y_validate), callbacks=[es, lr])


In [None]:
import matplotlib.pyplot as plt

figure = plt.figure(figsize=(20, 10))
ax = figure.add_subplot(1, 2, 1, title='Learning curves (loss)')
ax.set_xlabel("Epoch")
ax.set_ylabel("Loss")
ax.plot(history.history['loss'][5:], label = 'train')
ax.plot(history.history['val_loss'][5:], label = 'valid')
ax.legend()

ax = figure.add_subplot(1, 2, 2, title='Learning curves (accuracy)')
ax.set_xlabel("Epoch")
ax.set_ylabel("Accuracy")
ax.plot(history.history['accuracy'][5:], label = 'train')
ax.plot(history.history['val_accuracy'][5:], label = 'valid')
ax.legend()

plt.show()

In [None]:
ann.evaluate(X_validate, y_validate)