# Bag-of-words example

This notebook is a brief introduction to using bag of words. It is not meant as a guide to building a good Natural Language Processing network (it doesn't).

The data used in this example is from Kaggle's Disaster Tweets: https://www.kaggle.com/competitions/nlp-getting-started/overview

In [19]:
import pandas as pd
import numpy as np

tweets = pd.read_csv('train.csv')

tweets[tweets.target == 0]

Unnamed: 0,id,keyword,location,text,target
15,23,,,What's up man?,0
16,24,,,I love fruits,0
17,25,,,Summer is lovely,0
18,26,,,My car is so fast,0
19,28,,,What a goooooooaaaaaal!!!!!!,0
...,...,...,...,...,...
7581,10833,wrecked,Lincoln,@engineshed Great atmosphere at the British Li...,0
7582,10834,wrecked,,Cramer: Iger's 3 words that wrecked Disney's s...,0
7584,10837,,,These boxes are ready to explode! Exploding Ki...,0
7587,10841,,,Sirens everywhere!,0


The target is 1, if the tweet is about a disaster, and 0 otherwise. We'll try to 

Since this is about NLP, I'll just use the text even though the keyword looks useful.

The CountVectorizer finds all distinct words in the body of text (that is, all the rows). It returns a vector for each input text. The vector has a word count for how many times the word occured in the input text.

Note the conversions to numpy arrays. Keras is none to happy with Pandas Dataframes.

The shape of _X_ reveals that we have 7613 vectors (texts) and 21637 distinct words.

In [20]:
from sklearn.feature_extraction.text import CountVectorizer


y = tweets.target.to_numpy()
X = tweets.text.to_numpy()

X.shape, y.shape

((7613,), (7613,))

In [21]:
from sklearn.model_selection import train_test_split

X_, X_test, y_, y_test = train_test_split(X, y, train_size=.8, random_state=504)
X_train, X_validate, y_train, y_validate = train_test_split(X_, y_, train_size=.75, random_state=504)

In [22]:
import tensorflow as tf
from tensorflow.keras import layers

vocab_size = 5000
embed_dim = 128

vectorizationLayer = layers.TextVectorization(max_tokens=vocab_size)
vectorizationLayer.adapt(X)

ann = tf.keras.Sequential([
    vectorizationLayer,
    layers.Embedding(vocab_size, embed_dim, mask_zero=True),
    layers.GRU(128),
    layers.Dense(1, activation='sigmoid')
])

ann.compile(loss='binary_crossentropy', optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), metrics=['accuracy'])

In [23]:
es = tf.keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True)

ann.fit(X_train, y_train, epochs = 100, validation_data=(X_validate, y_validate), callbacks=[es])


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
  3/143 [..............................] - ETA: 5s - loss: 0.0322 - accuracy: 0.9792

KeyboardInterrupt: 